机器学习文献中通常使用许多不同的性能指标,用于输出分类决策的分类系统。一些最常见的是准确性,总误差(一个减去精度),平衡的精度,平衡的总误差(一个减去平衡精度),F得分和MATTHEWS相关系数(MCC)。在本文档中,我们回顾了这些指标的定义,并将其与预期成本(EC)进行比较,这是在每个统计学习课程中介绍的指标,但在机器学习文献中很少使用。我们表明,EC的经验估计是总误差和平衡总误差的广义版本。此外,我们证明了它与F-Score和MCC的关系,并认为EC优于它们,更一般,更简单,直观和动机。我们重点介绍了F-评分和MCC的一些问题,使它们成为次优指标。虽然在本手稿的当前版本中没有解释,但我们专注于根据艰难决策进行计算的指标,但EC具有额外的优势,是衡量系统分数校准的好工具,并允许用户做出最佳决策。每个班级的一组后代。我们将讨论留给本手稿的未来版本。
translated by 谷歌翻译
口语识别(SLR)是指用于确定语音样本中存在的语言的自动进程。例如,SLR是一个重要的任务,例如,作为分析或分类大量多语言数据的工具。此外,它也是用于在工作流中选择下游应用的必要工具,例如,选择适当的语音识别或机器转换模型。 SLR系统通常由两个阶段组成,其中提取表示音频样本的嵌入的一个阶段,并且第二个是计算每种语言的最终分数的次数。在这项工作中,我们将SLR任务接近作为检测问题,并实现第二阶段作为概率线性判别分析(PLDA)模型。我们表明,对PLDA参数的鉴别性培训相对于通常的生成培训提供了大的收益。此外,我们提出了一种新的分层方法是训练了两个PLDA模型,一个是生成高度相关语言的集群的分数,以及第二个是为每个群集产生分数的分数。最终的语言检测分数被计算为这两种分数的组合。完整的模型判别训练,以优化跨熵目标。我们表明,该层次方法始终如一地优于非等级化,以检测高度相关的语言,在许多情况下大幅度的边缘。我们培训我们的系统在包含100种语言的数据集合中,并在匹配和不匹配的条件下测试它们,表明增益是强大的状态不匹配。
translated by 谷歌翻译
根据研究人员在歧视和校准性能方面采用的标准评估实践,这项工作旨在了解阶级不平衡对胸部X射线分类器的性能的影响。首先,我们进行了一项文献研究,分析了普通科学实践并确认:(1)即使在处理高度不平衡的数据集时,社区也倾向于使用由大多数阶级主导的指标; (2)包括包括胸部X射线分类器的校准研究仍然罕见,尽管其在医疗保健的背景下的重要性。其次,我们对两个主要胸部X射线数据集进行了系统实验,探讨了不同类别比率下的几种性能指标的行为,并显示了广泛采用的指标可以隐藏少数阶级中的性能。最后,我们提出了通过两个替代度量,精密召回曲线和平衡的Brier得分,这更好地反映了系统在这种情况下的性能。我们的研究结果表明,胸部X射线分类器研究界采用的当前评估实践可能无法反映真实临床情景中计算机辅助诊断系统的性能,并建议改善这种情况的替代方案。
translated by 谷歌翻译
电话级发音评分是一个具有挑战性的任务,具有远离人类注释器的性能。标准系统在使用培训的模型中为每个手机生成一个分数,用于仅具有本机数据的自动语音识别(ASR)。使用专门用于使用非本机数据的任务的系统时,已经显示了更好的性能。然而,这种系统面临着标记为此任务的数据集的挑战是稀缺和通常很小的。在本文中,我们提出了一种基于转移学习的方法,它利用了用于ASR的模型,适应发音评分的任务。我们分析了几种设计选择的效果,并将性能与最先进的发音(GOP)系统进行比较。我们的最终系统比EPADB上的GOP系统,一个用于发音评分研究的数据库,优先考虑不必要的校正的低速率的成本函数更好。
translated by 谷歌翻译
Machine learning models have been found to learn shortcuts -- unintended decision rules that are unable to generalize -- undermining models' reliability. Previous works address this problem under the tenuous assumption that only a single shortcut exists in the training data. Real-world images are rife with multiple visual cues from background to texture. Key to advancing the reliability of vision systems is understanding whether existing methods can overcome multiple shortcuts or struggle in a Whac-A-Mole game, i.e., where mitigating one shortcut amplifies reliance on others. To address this shortcoming, we propose two benchmarks: 1) UrbanCars, a dataset with precisely controlled spurious cues, and 2) ImageNet-W, an evaluation set based on ImageNet for watermark, a shortcut we discovered affects nearly every modern vision model. Along with texture and background, ImageNet-W allows us to study multiple shortcuts emerging from training on natural images. We find computer vision models, including large foundation models -- regardless of training set, architecture, and supervision -- struggle when multiple shortcuts are present. Even methods explicitly designed to combat shortcuts struggle in a Whac-A-Mole dilemma. To tackle this challenge, we propose Last Layer Ensemble, a simple-yet-effective method to mitigate multiple shortcuts without Whac-A-Mole behavior. Our results surface multi-shortcut mitigation as an overlooked challenge critical to advancing the reliability of vision systems. The datasets and code are released: https://github.com/facebookresearch/Whac-A-Mole.git.
translated by 谷歌翻译
This project explores the feasibility of remote patient monitoring based on the analysis of 3D movements captured with smartwatches. We base our analysis on the Kinematic Theory of Rapid Human Movement. We have validated our research in a real case scenario for stroke rehabilitation at the Guttmann Institute5 (neurorehabilitation hospital), showing promising results. Our work could have a great impact in remote healthcare applications, improving the medical efficiency and reducing the healthcare costs. Future steps include more clinical validation, developing multi-modal analysis architectures (analysing data from sensors, images, audio, etc.), and exploring the application of our technology to monitor other neurodegenerative diseases.
translated by 谷歌翻译
Assessing the physical condition in rehabilitation scenarios is a challenging problem, since it involves Human Activity Recognition (HAR) and kinematic analysis methods. In addition, the difficulties increase in unconstrained rehabilitation scenarios, which are much closer to the real use cases. In particular, our aim is to design an upper-limb assessment pipeline for stroke patients using smartwatches. We focus on the HAR task, as it is the first part of the assessing pipeline. Our main target is to automatically detect and recognize four key movements inspired by the Fugl-Meyer assessment scale, which are performed in both constrained and unconstrained scenarios. In addition to the application protocol and dataset, we propose two detection and classification baseline methods. We believe that the proposed framework, dataset and baseline results will serve to foster this research field.
translated by 谷歌翻译
Developing robust and fair AI systems require datasets with comprehensive set of labels that can help ensure the validity and legitimacy of relevant measurements. Recent efforts, therefore, focus on collecting person-related datasets that have carefully selected labels, including sensitive characteristics, and consent forms in place to use those attributes for model testing and development. Responsible data collection involves several stages, including but not limited to determining use-case scenarios, selecting categories (annotations) such that the data are fit for the purpose of measuring algorithmic bias for subgroups and most importantly ensure that the selected categories/subcategories are robust to regional diversities and inclusive of as many subgroups as possible. Meta, in a continuation of our efforts to measure AI algorithmic bias and robustness (https://ai.facebook.com/blog/shedding-light-on-fairness-in-ai-with-a-new-data-set), is working on collecting a large consent-driven dataset with a comprehensive list of categories. This paper describes our proposed design of such categories and subcategories for Casual Conversations v2.
translated by 谷歌翻译
与单个IMU相比,多个刚性连接的惯性测量单元(IMU)传感器提供了更丰富的数据流。最先进的方法遵循IMU测量的概率模型,基于在贝叶斯框架下组合的错误的随机性质。但是,负担得起的低级IMU此外,由于其不受相应的概率模型所掩盖的缺陷而遭受了系统的错误。在本文中,我们提出了一种方法,即合并多个IMU(MIMU)传感器数据的最佳轴组成(BAC),以进行准确的3D置置估计,该数据通过从集合中动态选择最佳的IMU轴来考虑随机和系统误差所有可用的轴。我们在MIMU视觉惯性传感器上评估了我们的方法,并将方法的性能与MIMU数据融合的最新方法进行比较。我们表明,BAC的表现优于后者,并且在开放环路中的方向和位置估计都可以提高20%的精度,但需要适当的处理以保持获得的增益。
translated by 谷歌翻译
与2D栅格图像不同,没有用于3D视觉数据处理的单个主导表示。点云,网格或隐式功能等不同格式都具有其优点和劣势。尽管如此,诸如签名距离函数之类的网格表示在3D中也具有吸引人的属性。特别是,它们提供恒定的随机访问,并且非常适合现代机器学习。不幸的是,网格的存储大小随其尺寸而呈指数增长。因此,即使在中等分辨率下,它们也经常超过内存限制。这项工作探讨了各种低量张量格式,包括Tucker,Tensor Train和Wartenics Tensor tensor tensor tensor tensor分解,以压缩时间变化的3D数据。我们的方法迭代地计算,体素化和压缩每个帧的截断符号距离函数,并将张量式截断施加到代表整个4D场景的单个压缩张量中,将所有框架凝结到一个单个压缩张量中。我们表明,低级张量压缩对于存储和查询时间变化的签名距离功能非常紧凑。它大大降低了4D场景的内存足迹,同时令人惊讶地保留了它们的几何质量。与现有的基于迭代学习的方法(如DEEPSDF和NERF)不同,我们的方法使用具有理论保证的封闭式算法。
translated by 谷歌翻译